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UNIFICATION  OF  STATISTICAL  METHODS  FOR 


CONTINUOUS  AND  DISCRETE  DATA 
by  Emanuel  Parzen 

Department  of  Statistics,  Texas  A&M  University^ 


0.  Introduction 

This  paper  introduces  notation  and  concepts  which  establish  unity  and  analogues 
between  various  steps  of  statistical  data  analysis,  estimation,  and  hypothesis  testing  by 
expressing  them  in  terms  of  optimization  and  function  approximation  using  information 
criteria  to  compare  two  distributions.  The  contents  may  be  described  as  composed  of  two 
parts  whose  section  titles  are  as  follows. 

Part  I.  Statistical  Information  Mathematics  and  Comparison  Density  Functions. 

1.  Traditional  Entropy  and  Cross-Entropy 

2.  Comparison  Density  Functions 

3.  Renyi  Information  Approximation 

4.  Chi-square  Information  Divergence 

Part  11.  Comparison  Density  Approach  to  Unity  of  Statistical  Methods 

5.  One  Sample  Continuous  Data  Analysis 

6.  One  Sample  Discrete  Data  Analysis 

7.  Multi-sample  Data  Analysis  and  Tests  of  Homogeneity 

8.  Bivariate  Data  Analysis 

9.  Examples  of  One  Sample  and  Multi-Sample  Continuous  Data  Analysis 

1.  Traditional  Entropy  and  Cross-Entropy 

The  (Kullback-Liebler)  information  divergence  between  two  probability  distributions 
F  and  G  is  defined  by 

/OO 

log{ff(i)//(x)}/(i)dx 

-OO 

^Research  supported  by  the  U.S.  Army  Research  Office 
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when  F  and  G  are  continuous  with  probability  density  functions  /(i)  and  (i);  when  F  and 
G  are  discrete,  with  probability  mass  functions  Pf(x)  and  information  divergence 

is  defined  by 

/(F;G)  =  (-2)  ^log{pG(x)/PF(x)}p/’(I)• 

An  information  decomposition  of  information  divergence  is 

/(F;G)  =  //(F;G)-ff(F}, 

in  terms  of  entropy  H{F)  and  cross-entropy  H{F;G);  our  definitions  differ  from  usual 
definitions  by  a  factor  of  2; 

roo 

H{F)  =  {-2)  {log/(i)}/{x)di, 

J  —OO 

H{F\G)  =  (-2)  /  {\ogg{x))f{x)dx. 

J  —OO 

2.  Comparison  Density  Functions 

Information  divergence  I(F;G)  is  a  concept  that  works  for  both  multivariate  and  uni¬ 
variate  distributions.  This  paper  proposes  that  the  univariate  case  is  distinguished  by  the 
fact  that  we  are  able  to  relate  I(F;  G)  to  the  concept  of  comparison  density  d{u\ F,  G)  whose 
maximum  entropy  estimation  provides  significant  extensions  of  information  divergence. 

Quantile  domain  concepts  play  a  central  role;  Q(u)  =  F“^(u)  is  the  quantile  function. 
When  F  is  continuous,  we  define  the  density  quantile  function  fQ{u)  =  f{Q{u)),  score 
function  J(u)  =  —  (/Q(u))^  and  quantile  density  function 

q{u)  =  1//Q(u)  =  Q'{u). 

When  F  is  discrete,  we  define  fQ{u)  =  pf{Q{u)),  q[u)  =  Ij fQ{u). 

The  comparison  density  d{u;F,G)  is  defined  as  follows;  when  F  and  G  are  both 
continuous, 
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when  F  and  G  are  both  discrete 


d{u  :  F,G)  =  pc{F  ^{u))/pf{F  ^(u)). 


In  the  continuous  case  d(u;  F,  G)  is  the  derivative  of 


in  the  discrete  case  we  define 


D{u-F,G)  =  r 

Jo 


d{t;  F,  G)dt, 


Let  F  denote  the  true  distribution  function  of  a  continuous  random  variable  Y .  To  test 
the  goodness  of  fit  hypothesis  Hq  :  F  =  G,  one  transforms  to  =  G(Y’)  whose  distribution 
function  is  F{G~^(u))  and  whose  quantile  function  is  G(F“^(u)),  The  comparison  density 
d{u\F,G)  and  d[u\G,F)  are  respectively  the  quantile  density  and  the  probability  density 
oiW. 


3.  Renyi  Information  Approximation 

For  a  density  d{u),  0  <  u  <  1,  Renyi  information  (of  index  A),  denoted  IRx[d),  is  non¬ 
negative  and  measures  the  divergence  of  d{u)  from  uniform  density  do(u)  =  1,  0  <  u  <  1. 
It  is  defined; 

7Ro(d)  =  2  [  {d{u)  log  d{u)}du] 

Jo 

IR-l{d)  =  — 2  /  {Iogd(u)}du; 

Jo 

for  A  /  0  or  -1 

IRx{d)  =  {2/A(l  +  A)}  log  f  {d(u)}^+^du. 

Jo 

To  relate  comparison  density  to  information  divergence  we  use  the  concept  of  Renyi 
information  IRx  which  yields  the  important  identity  (and  interpretation  of  /(F;G)l) 

I(F;G)  =  (-2}  f\ogd(u;F,G)du 
Jo 

=  IR-i{d{u;F,G))  =  IRo{d{u;G,F)). 


3 


Interchanging  F  and  G:  One  can  prove  a  basic  identity: 


IRx{d[uiF,a))  = 


Note  A  =  — (l  +  A)  for  A  =  —.5.  Hellinger  information  divergence  is 

7/2-. 5(d)  =  -Slog  [  {d(u)}'^<iu. 

Jo 


Minimizing  IRx{d)  subject  to  constraints  on  d  is  equivalent,  for  A  >  0,  to  minimizing 
the  Lp  norm  of  d  for  p  =  1  + A;  we  can  apply  the  mathematical  theory  of  this  problem  which 
is  currently  being  developed  (Chui,  Deutsch,  Ward  (1990)).  Note  L2  norm  corresponds  to 
A  =  1.  The  minimizing  function  d''  will  satisfy  IRx{d')  <  IRxid). 

Convergence  Lemma.  If  dm{u)  is  a  sequence  of  densities  and  A  >  0, 

I Rx{dm{u))  converges  to  0  implies  Jq  Idm(ii)  —  Ijdu  converges  to  zero. 

Approximation  Theory.  To  a  density  d{u),  0  <  u  <  1,  approximating  functions  are 
defined  by  constraining  (specifying)  the  inner  product  between  d(u)  and  a  specified  function 
J(u),  called  a  score  function.  We  often  assume  that  the  integral  over  (0,1)  of  J(u)  is  zero, 
and  the  integral  of  J^{u)  is  finite.  A  score  function  J(u),  0  <  u  <  1,  is  always  defined  to 
have  the  property  that  its  inner  product  with  d{u),  denoted 

[J, d]  =  [J(u), d(u)]  =  [  J{u)d[u)du, 

Jo 

is  finite.  The  inner  product  is  called  a  component  or  linear  detector,  its  value  is  a  measure 
of  the  difference  between  d(«)  and  1. 

The  question  of  which  distributions  to  choose  as  F  and  G  is  often  resolved  by  the 
following  formula  which  evaluates  the  inner  product  between  J{u)  and  d(u;  F,G)  as  a 
moment  with  respect  to  G  if  J{u)  =  V3(7^''^(u)): 


/oo 

^{y)da{y)  =  Eg\<p(Y)\ 

-OO 


Often  G  is  a  raw  sample  distribution  and  F  is  a  smooth  distribution  which  is  a  model 
for  G  according  to  the  hypothesis  being  tested. 
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Approximations  in  L2  norm  are  based  on  a  sequence  Jfc(u),  k  =  1,2,...,  which  is 

a  complete  orthonormal  set  of  functions.  Then  if  d(u),  0  <  u  <  1,  is  square  integrable 

(equivalently,  IRi[d)  is  finite)  one  can  represent  d{u)  as  the  limit  of 

m 

dm{u)  =  1  +  ^{Jk,d\Jk{u),m  =1,2,.... 
k=l 

When  <Pk{y),  A:  =  1, 2, . . .,  is  complete  orthonormal  set  for  L2{F),  g{y)  is  approximated 


by  ^ 

9m{y)  =  f{y)  <  1  +  ^  {^k{y)} <Pk{y) 

I  fc=l 

We  call  dm(ti)  a  truncated  orthogonal  function  (generalized  Fourier)  series. 

An  important  general  method  of  density  approximation,  called  a  weighted  orthogonal 

function  approximation,  is  to  use  suitable  weights  to  form  approximations 

00 

d*{u)  =  1  +  ^Wk[Jk,d]Jk{u). 
k~l 

to  d{u).  Often  Wk  depends  on  a  “truncation  point”  m,  and  tojj.  1  as  m  — >  00. 

We  propose  that  non-parametric  statistical  inference  and  density  estimation  can  be 
based  on  the  same  criterion  functions  used  for  parametric  inference  if  one  uses  the  minimum 
Renyi  information  approach  to  density  estimation  (which  extends  the  maximum  entropy 
approach);  form  functions  ^'(it)  which  minimize  IRx{d''[u))  among  all  functions  d*(u) 
satisfying  the  constraints 


[Jk,  d'\  =  \  Jk,d]  for  fc  =  1, . . . ,  m 


where  Jk{u)  are  specified  score  functions.  One  expects  dx  m'[‘<^)  to  converge  to  d{u)  as  m 
tends  to  00,  and  I Rx[dx^rn)  fo  non-decreasingly  converge  to  IRx[d). 

Quadratic  Detectors.  To  test  Hq  :  d{u)  =  1,  0  <  u  <  1,  many  traditional  goodness  of 
fit  test  statistics  (such  «is  Cramer-von  Mises  and  Anderson-Darling)  can  be  expressed  as 
quadratic  detectors 


J2{rvk[Jk,d\}^  =  {d*{u]-lfdu 
k=l 

=  I  {d*{u)}^  du  —  1  =  -I  +  exp  IRi{d*). 

Jo 
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We  propose  that  these  nonadaptive  test  statistics  are  only  of  historical  interest  since  they 
are  not  as  powerful  as  minimum  Renyi  information  detectors  IRx{dx  „^")•,  in  addition  the 
latter  provide  unification  of  statistical  methods. 

Maximum  entropy  approximators  correspond  to  A  =  0;  do,m  (^)  satisfies  an  exponen¬ 
tial  model  (whose  parameters  are  denoted  . . .  ,0m) 

m 

k=l 

where  ^  is  the  integrating  factor  that  guarantees  that  do,m"(^)»  0  <  u  <  1,  integrates  to 

{Ol,...,0m)  =  log  y  exp  I 

The  approximating  functions  formed  in  practice  are  not  computed  from  the  true  com¬ 
ponents  [Jk,d]  but  from  raw  estimators  for  suitable  raw  estimators  cr{u).  The 

approximating  functions  are  interpreted  as  estimators  of  a  true  density.  Methods  proposed 
for  unification  and  generalization  of  statistical  methods  use  minimum  Renyi  information 
estimation  techniques.  Different  applications  of  these  methods  differ  mainly  in  how  they 
define  the  raw  density  <r(u)  which  is  the  starting  point  of  the  data  analysis. 

4.  Chi-square  information  divergence 

In  addition  to  Renyi  information  divergence  (an  extension  of  information  statistics) 
one  needs  to  use  an  extension  of  chi-square  statistics  which  hzus  been  developed  by  Read 
and  Cressie  (1988).  For  A  ^  0,  1,  Chi-square  divergence  of  index  A  is  defined  for  continuous 
F  and  G  by 

where 

Bold]  =  2  {d\ogd  —  d  +  1} 

B-i{d)  =  -2{\ogd  -  d  +  1} 
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Important  properties  of  Bx{d)  are; 


B[{d)  =  l{d’'-  l)  .B'lid)  =  U>‘-' 

Bi(d)  =  {d-lf 
BQ{d)  =  2{d  log  (i  —  rf  +  1) 

B^.sid) =4(^d-^ -ly 
B^l{d]  =  — 2(log  d  —  d  +  1) 

B-2{d)  =d{d~^-lf 

Renyi  information  of  index  A  is  defined  for  continuous  F  and  Q:  for  A  7^  0,  1 

=  -2j  |log^|/(!,)* 

An  analogous  definition  holds  for  discrete  F  and  G. 

The  Renyi  information  and  chi-square  divergence  meeisures  are  related; 

IRo{F-G)=Co{F-G) 

IR-i{F;G)  =  C-.i{F;G) 

For  A  7^  0, 1, 

Interchange  of  F  and  G  is  provided  by  the  Lemma: 

Cj(F;G)  =C^,n.j)(G;f) 

/fi;,(f;G)  =/ft_„+*)(G;F) 
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For  a  density  d(u),  0  <  u  <  1,  define 


CM  =  / 

JO 

The  comparison  density  again  unifies  the  continuous  and  discrete  cases.  One  can  show 
that  for  univariate  F  and  G 

Cx(F,G)  =  cM<‘:r',o)) 

5.  One  Sample  Continuous  Data  Analysis 

We  now  apply  statistical  information  mathematics  to  describe  a  unified  approach  to 
one  sample  continuous  data  analysis  which  uses  optimization  and  approximation  based  on 
information  criteria  to  develop  methods  which  are  simultaneously  parametric,  nonpara- 
metric,  maximum  entropy  nonparametric,  estimation,  testing  parametric  hypotheses,  and 
goodness  of  fit  of  parametric  model.  Let  Vi,. . .  ,Vn  be  a  random  sample  of  a  continuous 
random  variable  V  with  true  unknown  distribution  F  and  sample  distribution  F~. 

A  parametric  model  F(x;  6)  for  F  assumes  that  the  true  probability  density  function 
belongs  to  a  parametric  family  f{x\0)  with  distribution  function  F[x\d).  The  maximum 
likelihood  estimator  9^  minimizes 

/(F*;  F(-;  9))  =  IR-i(d(u;  F~,  F(-;  9)). 

To  prove  the  proposition,  we  denote  by  L(9)  the  twice  average  log  likelihood  function; 

i(«)  =  (2/n)iog/(yi . r„;«) 

=  2B-|iog/(yi«)| 

/oo 

log/(t/;9)dF~(!/). 

-oo 

To  maximize  likelihood  we  express  it  as  minus  cross-entropy: 

L(9)  =  -ff(F-;F(-;9)). 
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Temporarily  assuming  away  the  fact  that  F~  has  only  a  symbolic  density  /',  the  mjiximum 
likelihood  estimator  9^  can  be  regarded  as  minimizing  over  6 

9"  may  be  interpreted  as  the  parameter  value  9  for  which  the  sample  quantile  function 
F{F  ^(u);0)  of  the  transformed  variable  =  F{Y\9)  is  closest  to  uniform.  Tra¬ 
ditional  goodness  of  fit  statistics  test  how  close  to  uniform  is  the  sample  distribution 
function  of  Wq~  whose  symbolic  probability  density  is  a  raw  estimator  of 

d(u;F(-;r),F). 

Outline  of  statistical  reasoning:  We  propose  that  the  various  steps  of  statistical  rea¬ 
soning  compose  4  actions  which  are  the  goals  of  statistical  science: 

1.  Make  observations  (step  0)  and  summarize  by  F';  2.  Form  expectations  (steps  1 
and  2)  which  is  a  parametric  model  for  the  observations  expressed  by  Ff-;  0‘);  3.  Compare 
observations  and  expectations  (steps  3  and  4);  4.  Revise  model  to  fit  observations  (steps 
5  and  6).  The  revised  model  is  equivalent  to  a  nonparametric  estimator  F*. 

Step  0.  Observations.  The  sample  is  summarized  by  its  sample  distribution  function 
F~  and  its  sample  quantile  function  Q~. 

Step  1:  Parametric  Model  Specification.  Using  diagnostic  tools  (such  as  the  identifi¬ 
cation  quantile  function)  identify  a  parametric  family  F[x\9)  such  that  for  all  9 

supud(u;  F(-;  0)) ,  F)  <  oo. 

Step  2;  Parameter  estimation.  Maximum  likelihood  estimator  9'  can  be  obtained  by 
minimizing 

/F_l(d(u;F-,F(-;^)) 

A  parametric  estimator  of  F  is  F'(i)  =  F(x;9~). 

Step  2*:  Robust  parameter  estimators  9'^  can  be  obtained  by  minimizing 

IR),  {d{u-,F\F{-,9))) 
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for  a  suitable  smooth  non-parametric  distribution  function  estimator  F*  and  suitable  values 
of  A,  usually  chosen  in  the  interval  —  1  <  A  <  0. 

Step  3:  Parametric  hypothesis  testing.  To  test  a  hypothesis  Hq  about  the  parameter  0, 
let  denote  the  maximum  likelihood  estimator  of  6  under  Hq]  equivalent  to  likelihood 
ratio  tests  is  the  test  statistic 

IR-i  {d  (u;  F~,  F  {-JhoI))  -  JR-i  («•-  F  (•;  r))) 

Step  4:  Goodness  of  fit  test  of  Hq  :  F  =  F(-;  0‘)  or  equivalently  Hq  :  d(u;  F(-;  d~),  F)  = 
0.  Test  the  significance  of  the  difference  from  zero  of 

7Fo(d(u;F(-;r),n  -  /F-i(d(u;  F',  F(.;  r)). 

Step  5:  Maximum  entropy  goodness  of  fit  tests  and  estimators  do,m*(w)  of 
d(u;  F(-;  0'),  F)  are  obtained  by  minimizing  Io{d')  among  densities  d*(u)  satisfying,  for 
fc  =  1, . . . ,  m  and  specified  score  functions 

|4,oi-|  =  [Jt.dT] 

defining  cr(u)  =  d{u;  F{’]  0*),  F*).  For  m  large  enough  do,m  (^)  equals  dr{u)  and  /Fo(<fo,m') 
increases  to  the  test  statistic  of  Step  4,  /Fo(d(«;  F(-;  0*),  F~)). 

Step  6:  Rejection  simulation  nonparametric  estimation  of  F.  Use  an  order  determining 
criterion  to  determine  an  order  rn  with  the  properties:  if  m'  =  0,  accept  Hq\  if  one  rejects 
Hq  use  ^  density  to  be  used  in  the  rejection  method  of  simulating  a  random 

sample  from  F.  The  combination  of  F{-\0')  and  's  regarded  as  an  estimator  F*. 

We  propose  that  order  determining  criteria  should  be  regarded  as  providing  density 
estimators  which  require  further  goodness  of  fit  tests.  We  propose  (as  an  open  research 
problem)  a  method  for  testing  if  a  smooth  estimator  d*(u)  adequately  smooths  a  raw 
estimator  cT:  test  if  the  ratio  (r{u)/d"{u)  has  as  its  best  smoother  a  constant  function. 
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6.  One  Sample  Discrete  Data  Analysis 


Step  1:  Identify  a  parametric  family  of  probability  mass  functions  p{x;0)  to  model  the 
sample  probability  mass  function  p*(i). 

Step  2;  Parameter  estimation.  Maximum  likelihood  estimator  0*  can  be  obtained  by 
minimizing 

IR-lidiu;F\F{-;6)))  =  (-2)  log{p(i;  0)/p-(x)}p-(x) 

X 

A  parametric  estimator  of  p  is  p"(x)  =  p(x;0").  Minimum  chi-square  estimation  uses  the 
modified  chi-squared  distance 

=  ^{(p(x;«)/p-(i))  -  ifp'ix) 

X 

Step  3:  Paraunetric  hypthesis  testing.  To  test  a  hypothesis  Hq  about  the  parameter 
0,  let  denote  the  minimum-modified  chi  square  estimator  of  0  under  Ho\  equivalent 
to  likelihood  ratio  tests  is  the  test  statistic 

IRl{d{u;  r,  F{-,  «Ho‘)))  -  IRl(d(u;  f,  f  (•;  r))) 

Step  4;  Goodness  of  fit  test  of  Hq  :  p  =  p{-\ 0")  or  equivalently  Hq  ’  d{u\  F(-;  0‘) ,  F)  =0. 
Test  the  significance  of  the  difference  from  zero  of 

/Fi(d(u;F(-;n,F*))  =  /F-2(d(u;  F',  F(-;  0‘))). 

Step  5:  Maximum  entropy  goodness  of  fit  tests  and  estimators  do,m*(*^)  of 
d(u;F(-;0*},F)  are  obtained  by  minimizing  /Fo(d*)  among  densities  d*(u)  satisfying,  for 
A:  =  1, . . . ,  m  and  specified  score  functions 

[4,d‘]=:(4,(rj 

defining  <r(u)  =  d(u;  F(-;  0'),  F').  For  m  large  enough  do,m'(w)  equals  cr(u)  and  /Fo(do,m*) 
increases  to  a  test  statistic  (alternative  to  that  of  Step  4)  IRo(d(u;  F(-;0"),  F  )). 
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Step  6  Rejection  simulation  nonparametric  estimation  of  F.  Use  an  order  determining 
criterion  to  determine  an  order  m*  with  the  properties:  if  m*  =  0,  accept  Hq;  if  one  rejects 
Hq  use  do  density  to  be  used  in  the  rejection  method  of  simulating  a  random 

sample  from  F.  The  combination  of  and  do_m**(“)  is  regarded  as  an  estimator  F*. 

7.  Multi-Sample  Data  Analysis  and  Tests  of  Homogeneity 

Multi-sample  data  arises  when  one  observes  the  values  of  a  variable  Y  in  several  pop¬ 
ulations  which  can  be  regarded  as  indexed  by  a  variable  X.  One  can  therefore  regard 
multi-samples  as  independent  observations  of  a  bivariate  random  variable  {X,Y).  Con¬ 
ventional  multi-sample  statistical  analysis  is  concerned  with  testing  the  hypothesis  Hq  of 
homogeneity,  which  we  express 


Hq  :Y  \s  independent  of  X. 

To  formulate  Hq  in  terms  of  comparison  density  functions  let  us  note  that  non¬ 
parametric  statistics  are  based  on  replacing  the  response  Y  by  its  rank  transform  which 
in  the  population  is 

W  =  Fy{Y) 

The  sample  rank  transform  is 

W~  =  Py'iY) 

where  Py’iy)  is  sample  mid-distribution  function  of  Y  defined  by 

PY~{y)  -  Fy~{y)  -  .5py'(y), 

in  terms  of  the  sample  distribution  function  F~  and  sample  probability  mass  function  p~. 

The  sample  quantile  function  of  the  W~  values  which  are  rank  transforms  of  Y  values 
associated  with  a  fixed  value  of  X  is  an  estimator  of  the  conditional  quantile  function  of 
W 

Qw-.xi^)  =  f'Y-.xif'y^i'^))  =  D{u\Fy,Fy.x)- 
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The  innovation  of  our  approach  is  a  new  type  of  linear  rank  statistics  which  are  estimators 
of  the  form,  called  sample  components, 

[Jk  («)>  <^(«;  Py~->Py-.x~)\ 


of  population  components 

\Jk{'^),d{u\FY,FY-x)\ 

for  suitable  score  functions 

Score  functions  J(u)  =  </)(Fy  ^(u))  satisfy 

j^J(u),c{(u;Fy,Fy|x)]  =  EY\x{<p{y)]  =  Ey\x[J{Fy  {¥))]•, 

this  component  is  the  conditional  mean  given  X  of  where  W  =  Fy{Y)  is  the 

rank  transform  of  V.  A  Wilcoxon  statistic  corresponds  to  J(u)  =  (12)‘^(u  —  .5)  whose 
sample  component  is  equivalent  to  a  rank-sum  statistic.  The  conditional  mean  FIVIX]  is 
a  component  with  score  function  J(u)  =  Qy(u). 

The  traditional  approach  to  multiple  sample  tests  of  homogeneity  is  to  test  the  signifi¬ 
cant  difference  from  zero  of  the  sample  components.  We  propose  that  a  comprehensive  way 
to  test  the  homogeneity  hypothesis  JIq  is  to  estimate  the  comparison  density  d{u]  Fy ,  Fyj;^) 
for  each  value  of  X,  and  various  chi-square  statistics 

^X,X  =  C'A(cf{u;Fy,Fy.x)). 

8.  Bivariate  Data  Analysis 

Another  approach  to  understanding  the  role  of  chi-squared  measures  of  the  difference 
of  the  comparison  density  from  the  uniform  d{u)  =  l,0<u<l,  isto  regard  X  and 
Y  as  random  variables  and  express  the  homogeneity  hypothesis  Hq  as  a.  hypothesis  of 
independence: 

Eq  ■  FY\x{y\x)  =  ^K(y)  for  all  y  and  x. 
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Bivariate  data  analysis  can  be  unified  by  the  dependence  density  function  defined  for  0  < 
U2  <  1  by 

Traditional  maximum  likelihood  estimators  (and  EM  algorithms)  for  response  vari¬ 
ables  Y  with  covariates  X  can  be  based  on  the  information  measure  of  dependence,  called 
mutual  information,  defined  for  continuous  random  variables  X  and  Y  by 

I{Y\X)  =  IR-i{Fx,y^FxFy)  =  lR-iifx,Y^fxfY) 

=  (-2)  J  log{/x(x)/K(y)//x,y(a:,y)}/xv(^’y)^^‘^y 

The  fundamental  relation  usually  used  to  study  the  information  about  Y  m  X  measured 
by  I{Y\X)  is 

I{Y\X)  =  H{Y\X)  -  H{Y) 

defining  H{Y\X)  =  ExH{fY\x)^  called  conditional  entropy  of  Y  given  X. 

We  obtain  a  fundamental  relation  expressing  mutual  information  in  terms  of  compar¬ 
ison  density  function  of  Fy  and  Fy^x  which  measures  how  well  fy  models  y\x'' 

I(Y\X)  =  ExIRo{d{<‘-,FY,FYlx)) 

=  ExCQ[d{u-,FY,FYix)) 

This  is  proved  by  writing 

I{Y\X)  =  2 1  dxfxix) 

j  dyfy{y)  log{/K|x(y|a:)//r(y)}{/y|A'(y|a^)//y(y)} 

=  ^og{d{u;  Fy ,  Fy^x)}<i{u\  Fy ,  Fy^x)^^ 

Traditional  chi-squared  test  statistics  satisfy 


Cx{Fx,Y'FxFy)  =  FxCx{d{u]Fy^x^Fy)) 
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We  define  the  chi-square  divergence  (of  index  A)  of  Y  given  X  to  be 


Ci(y|X)  =  BxCx[d{u-,Fy,FYix)) 

Traditional  chi-square  statistics  for  discrete  data  use  A  =  1.  Read  and  Cressie  (1988) 
recommend  A  =  2/3. 

The  notation  is  at  hand  to  state  our  comparison  density  approach  to  multi-sample 
data  analysis  and  tests  of  homogeneity; 

Step  1:  Form  raw  estimates  for  each  value  of  X 

=  d{u-,FY~,FYix~) 

which  is  computed  using  the  formula  for  comparison  density  function  of  sample  discrete 
distributions. 

Step  2:  Form  and  test  significance  of  difference  from  zero  of  various  components 

for  suitable  score  functions 

Step  3:  Estimators  of  dx{u)  =  <i(u;  Fy ,  Ey.;^)  by  minimum  Renyi  information  esti¬ 
mators 

dx,X,m'{'^) 

subject  to  constraints 

Step  4:  Smooth  chi-squared  tests  of  Hq  are  bzised  on  smooth  density  estimators  sub¬ 
stituted  in  the  population  formuleis 


Cx{y\X)  =  ExiCx.xl 
Cx,i  =  Ca(<<{u:/V,^V|a')) 

Further  one  can  disaggregate  Cx^x  into  statistics  Cxy^x  called  “hanging  Chi-squares”. 
They  are  asymptotically  distributed  as  Chi-Squared  with  1  degree  of  freedom.  If  one 
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rejects  the  hypothesis  of  homogeneity,  the  hanging  Chi-squares  help  identify  the  sources 
of  rejection. 

This  outline  requires  many  details  and  examples  to  be  understandable  by  statisticians 
not  used  to  the  point  of  view  of  statistical  culture. 

Contingency  table  data  analysis.  For  0  <  p  <  1,  define  ODDS  (p)  =  p/(l  —  p).  For  r 
by  c  contingency  table,  total  sample  size  N,  one  forms  sample  statistics 

C 

k=  1 
r 

Ck,X~  =  X]  "  PY~U)} 
i=l 

=  ODDS  {px-{k))  ODDS  (pj.-(j))Ba 

Asymptotic  distributions  of  test  statistics: 


{N  —  1)C;^*(F|X)  is  Chi-square  ((r  -  l)(c  —  l)) 
{N  —  is  Chi-square  (r  -  1) 

{N  -  is  Chi-square 


Multiple-Sample  Goodness  of  Fit  Tests.  One  can  associate  a  weighted  orthogonal  series 
density  estimator  d*(u;  ,  jFV|x)  each  value  of  X,  using  suitable  complete  orthonormal 

functions  (Pj{u)  and  weights  Wj. 


CHY\X)  =  Y.{i-Px'mCl:' 

k=l 

Ci,-=  ODDS(px'W)/ 

oo 

J-1 


ODDS  {px'{k)} 


iPj{u),(r{u]  Fy,  FY\x=k) 


2 


Cramer-von  Mises  Goodness  of  Fit  Test; 


Wj  =  l/jn,ipj{u)  =  2-^cos{jnu). 
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.1 


rl  oo 

/  {D{u)  -  u}'^  du  =  '^wj  [<i>j  ,d] 
°  J=l 


Anderson  Darling  Goodness  of  Fit  Test: 


=  {l/y(y  +  1)}'^ ,  («)  =  (2;  +  1)-^  Py(2u  -  1), 

-1  OO 

Jq  ~  u}^/u(l  -  u)|  du  =  [^Py,d]^; 


;=i 


py(t)  are  Legendre  polynomials  on  [-1,1]. 

Hermite  Polynomial  Goodness  of  Fit  Test: 


wj  =  1/y,  <pj{u)  =  (yi)  ($  ^(u)); 

Hjix)  are  Hermite  polynomials. 

9.  Examples  of  One  Sample  and  Multi-Sample  Continuous  Data  Analysis: 

National  Bureau  of  Standards  NBIO  Measurements:  Freedman,  Pisani,  Purves  in  their 
textbook  on  Statistics  (p.94)  report  100  measurements  of  the  10  gram  check-weight  NBIO 
made  at  the  National  Bureau  of  Standards.  They  report:  “The  normal  curve  does  not 
fit  at  all  well.  The  normal  curve  does  fit  the  data  with  three  outliers  removed.  The 
normal  curve  fitted  to  these  measurements  has  an  average  of  404  micrograms  below  10 
grams,  and  a  standard  deviation  of  about  4  microgreims.  But  in  a  small  percentage  of 
cases,  the  measurements  are  quite  a  bit  farther  away  from  the  average  t^an  the  normal 
curve  suggests.  The  overall  standard  deviation  of  6  micrograms  is  a  compromise  between 
the  standard  deviation  of  the  main  part  of  the  histogram  (4  micrograms)  and  the  three 
outliers,  representing  deviations  of  18,  -30,  and  32  micrograms.  In  careful  measurement 
work,  a  small  percentage  of  outliers  is  expected.  The  only  unusual  aspect  of  the  NBIO 
data  is  that  the  National  Bureau  of  Standards  reported  its  outliers;  many  investigators 
don’t.  Realistic  performance  parameters  require  the  acceptance  of  all  data  that  cannot  be 
rejected  for  cause.” 
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The  NBIO  data  illustrates  the  statistical  analysis  strategy  that  we  propose  be  routinely 
applied  to  data.  Step  1.  Specify  a  parametric  probability  model  for  the  data  (here  the 
model  is  normal).  Step  2.  Estimate  parameters  of  the  model  (here  mean  and  standard 
deviation)  to  be  10  grams-404  micrograms  and  6  micrograms  respectively.  Step  2*.  Robust 
parameter  estimation  by  Renyi  information  of  index  between  0  and  1  obtains  as  estimators 
of  a  normal  model  (fitted  to  the  part  of  the  data  that  can  be  well  fitted  by  a  normal  model) 
the  same  mean  and  a  standard  deviation  of  4  micrograms.  Step  4;  Goodness  of  fit  test  of 
normality  by  traditional  tests.  Step  5;  Maximum  entropy  estimator  of  comparison  density 
d(u;  normal  model,  data)  clearly  indicates  the  nature  of  the  data;  a  poor  fit  of  normal 
model  to  data.  Shape  of  d*(u)  in  interior  of  interval  (0,1)  can  be  interpreted  as  expected 
curve  if  d*(u)  estimates 

d{u-,N(0,{6f],N{0,[4f)) 

=  Kexp  |-.5  (/c^  -  l)  ,(C  =  6/4. 

Peaks  of  d*(u)  at  u  =  0, 1  indicate  longer  tails  than  normal.  In  general,  one  must  decide 
whether  to  consider  these  tails  in  <r(u)  as  outliers  or  as  evidence  that  a  longer  tailed 
distribution  than  the  normal  should  be  used  to  model  the  data.  In  Figure  2  two  graphs 
illustrate  the  comparison  density  estimation  process:  the  raw  estimator  (r(u)  superimposed 
on  a  smooth  estimator  d*(u);  the  exponential  model  smooth  estimator  do, 4%  the  orthogonal 
polynomial  estimator  and  a  naive  step  function  estimator  d*  representing  increments 
of  D~{u)  on  8  equal  subintervals.  Diagnostic  tools  at  step  1  which  help  identify  probability 
models  for  the  data  are  illustrated  by  a  IQQ  plot  of  the  sample  quantile  function  of  the  data 
versus  the  quantile  function  of  a  normal  with  density  /(x)  =  exp(— ttx^).  The  informative 
quantile  function  of  the  sample  is  defined  Qr{u)  =  {Q~{u)  -  Q~(.5)}/2{Q'(.75)  ~  Q~(.25)}. 

Breaking  Stress  of  Beam:  Cheng  and  Stephens  (1989)  give  a  data  set  of  breaking  stress 
of  41  beam  specimens  cut  from  a  single  carbon  block  of  graphite  H590,  and  discuss  goodness 
of  fit  vests  of  the  hypothesis  that  the  data  is  normal.  Let  F{-;0')  denote  the  normal 
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distribution  with  majcimum  likelihood  estimated  value  of  6.  They  show  that  Moran’s 
statistic,  which  is  equivalent  to  /i2o(d(u;  F~)  “correctly”  rejects  the  hypothesis  that 

the  sample  is  normal,  in  contrast  to  more  traditional  empirical  distribution  based  statistics 
(such  as  Kolmogorov-Smirnov  and  Cramer-von  Mises)  which  accept  the  hypothesis  of 
normality  for  the  sample  tested.  The  comparison  density  estimation  approach  indicates  the 
nature  of  the  data;  an  excellent  fit  of  normal  model  in  interior  of  interval  (0,1)  but  peaks  at 
u  =  0, 1  indicate  outliers  or  long  tails  (clearly  evident  in  stem  and  leaf  table  of  the  data). 
One  conjectures  that  a  symmetric  extreme  value  distribution  would  be  a  more  appropriate 
model.  Figure  3  illustrates  the  comparison  density  estimation  process  for  a  normal  model 
F{-;  0").  The  graph  of  D(u;  F(-;  6"),F~)  is  graphically  well  fitted  by  a  uniform  distribution, 
and  therefore  passes  traditional  goodness  of  fit  tests.  The  raw  estimator  d{u]  F{-\9"),  F~) 
is  superimposed  on  a  smooth  estimator.  The  exponential  model  smooth  estimator  d"(u)  is 
superimposed  on  a  step  function  estimator  computed  from  increments  of  D(u;  F(-,  0"),  F~) 
over  8  sub-intervals. 

Cheng  and  Stephens  Break  Stress  Data 
(Stem  and  Leaf) 


27 

.55 

28 

29 

.89 

30 

.07 

.65 

31 

.23 

.53 

.53 

.82 

32 

.23 

.28 

.69 

.98 

33 

.28 

.28 

.74 

.74 

.86 

.86 

.86 

34 

.15 

.15 

.15 

.44 

.62 

.74 

.74 

35 

.03 

.03 

.32 

.44 

.61 

.61 

.73 

.90 

36  , 

.20 

.78 

37  1 

.07 

.36 

.36 

.36 

38 

39 

40 

.28 

Multisample  of  ratio  of  assessed  value  to  sale  price  of  residential  property:  To  illustrate 
the  comparison  density  approach  to  testing  multil-samples  for  homogeneity,  we  consider 
data  analysed  by  Boos  (1986)  on  ratio  of  assessed  value  to  sale  price  of  residential  property 
in  Fitchburg,  Meiss.,  1979.  The  samples  (denoted  I,  II,  III,  IV)  represent  dwellings  in  the 
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categories  single-family,  two-family,  three-family,  four  or  more  families.  The  sample  sizes 
(54,  43,  31,  28)  are  proportions  .346,  .276,  .199,  .179  of  the  size  156  of  the  pooled  sample. 
We  interpret  these  proportions  as  ^  —  l,-.-,4.  We  compute  Legendre,  cosine, 

Hermite  components  {Cy  up  to  order  4  of  the  4  samples;  they  are  asymptotically 
standard  normal.  We  consider  components  greater  than  2  (3)  in  absolute  value  to  be 
significant  (very  significant). 

Legendre,  cosine,  and  Hermite  components  are  very  significant  only  for  sample  I, 
order  1  (-4.06,  -4.22,  -3.56  respectively).  Legendre  components  are  significant  for  sample 
IV,  orders  1  and  2  (2.19,  2.31).  Cosine  components  are  significant  for  sample  IV,  orders  1 
and  2  (2.36,  2.23)  and  sample  III,  order  1  (2.05).  Hermite  components  are  significant  for 
sample  IV,  orders  2  and  3  (2.7  and  -2.07). 

Conclusions  are  that  ihe  four  samples  are  not  homogeneous  (have  the  same  distribu¬ 
tions).  Samples  I  and  IV  are  significantly  different  from  the  pooled  sample.  Estimators  of 
the  comparison  density  provide  a  substantive  conclusion;  they  show  that  sample  I  is  more 
likely  to  have  lower  values  than  the  pooled  sample,  and  sample  IV  is  more  likely  to  have 
higher  values,  suggesting  that  one  family  homes  are  underassessed  and  four  family  homes 
are  overassessed,  while  two  and  three  family  homes  are  fairly  assessed. 

When  one  compares  components  with  traditional  empirical  distribution  based  tests 
one  concludes  that  the  insights  are  provided  by  the  linear  rank  statistics  of  orthogonal 
polynomials  rather  than  by  portmanteau  statistics  of  Cramer-von  Mises  or  Anderson- 
Darling  type.  Comparison  density  functions,  which  compare  each  sample  with  the  pooled 
sample,  can  provide  the  most  substantive  information. 
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Figure  1 

To  understand  the  shapes  of  comparison  density  functions,  graphs  of  d(u-  G  F)  and 
d{u;F,G)  for  two  cases.  Case  1:  F  normal  (median  0,  density  at  median  iV’c’Cauchy 
(median  0,  density  at  median  1).  Case  2:  F  normal  (median  0,  density  at  median  1)  G 
symmetric  extreme  value  (median  0,  density  at  median  1), 


d{u-  Symmetric  Extreme  Value,  Normal) 


Raw  d~(u)  to  test  normality,  smooth  by  d"(u) 


Estimators  <r(u;  normal,  data)  Orthogonal  Polynomial; 
Exponential  Model  (graph  closest  to  graph  of  step  function) 


Raw  cr(u)  to  test  normality,  smooth  by  orthogonal  polynomial  d*(u) 


i 

Figure  4 

Ratio  of  assessed  price  to  sale  price  of  residential  property 
For  samples  I  and  IV,  sample  comparison  distribution  function  D  (u) 


For  samples  I  and  IV,  sample  comparison  density  <r(u),  sample  quartile  density  dQ~{u) 
(square  wave),  nonparametric  density  estimator  <r(u) 


For  samples  I  and  TV,  Legendre,  cosine,  and  Hermite  orthogonal  polynomial  estimator  of 
order  4  of  the  comparison  density,  denoted  d4(u),  compared  to  sample  quartile  density 

dg'(u). 

U|,  Cos(x'f),  ■•^*'1)  Ui,  CasCx's),  Ntr(*’t)  knsity 
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